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ABSTRACT 

Community detection has arisen as one of the most relevant 
topics in the field of graph data mining due to its importance 
in many fields such as biology, social networks or network 
traffic analysis. The metrics proposed to shape communi- 
ties are generic and follow two approaches: maximizing the 
internal density of such communities or reducing the connec- 
tivity of the internal vertices with those outside the commu- 
nity. However, these metrics take the edges as a set and do 
not consider the internal layout of the edges in the commu- 
nity. We define a set of properties oriented to social networks 
that ensure that communities are cohesive, structured and 
well defined. Then, we propose the Weighted Community 
Clustering (WCC), which is a community metric based on 
triangles. We proof that analyzing communities by trian- 
gles gives communities that fulfill the listed set of properties, 
in contrast to previous metrics. Finally, we experimentally 
show that WCC correctly captures the concept of commu- 
nity in social networks using real and syntethic datasets, and 
compare statistically some of the most relevant community 
detection algorithms in the state of the art. 

1. INTRODUCTION 

Although graphs are a very intuitive representation of 
many datasets, retrieving information from them is far from 
easy. The increasingly growing datasets during the last years 
have made it very difficult to intuitively extract and analyze 
the information of the graphs generated from those data 
sources. Large graphs have often many relationships that 
make their visual analysis impossible and make the under- 
standing of the structural components of the graph difficult. 

Communities are informally defined as sets of vertices 
which are densely connected but scarcely connected to the 
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rest of the graph. The retrieval of vertex communities (or 
clusters) provides information about the sets of vertices that 
respond to a similar concept [T3^. In social networks, com- 
munities identify groups of users with similar interests, lo- 
cations, friends or occupations. This information is useful 
to craft new ways to represent data in visual analysis appli- 
cations [7 or to reduce the access times to this data thanks 
to a more coalesced data placement [29 . 

Several metrics have been proposed as indicators of the 
quality of a community [16l[2Tl[26j. Among them, modu- 
larity and conductance are those which have become more 
popular [10 and precise [JT, respectively. Modularity com- 
pares the internal edge density of the community with the 
average edge density of the graph. On the other hand, con- 
ductance computes the ratio between edges inside the com- 
munity and the edges in the frontier (i.e. the cut) of the 
community. Both metrics take the edges as a set of ob- 
jects without paying attention to the internal structure of 
the community. One of the consequences is that the op- 
timization of such metrics generates communities without 
noticeable structure that empirically cannot sometimes be 
considered communities [21]. 

We find that the informal definition of community stated 
previously is too lax for social networks, because it does 
not consider the internal edge layout of the community. As 
a first contribution, we introduce a set of basic structural 
properties that a good community metric for social networks 
must fulfill. These properties ensure that communities are 
cohesive, structured and well defined. An example of these 
properties is that communities in social networks must be 
dense in terms of triangles. A triangle is a transitive rela- 
tion between three vertices. For example, a triangle appears 
when A is a friend of 5, B is a friend of C, and C is a 
friend of A. Social networks are known to contain more tri- 
angles than expected by chance (Erdos-Renyi graph), which 
gives a community structure to the graph [2Hl27ll31ll33] . The 
triangle is a simple structure that depicts a strong relation 
among three vertices. Furthermore, complex dense struc- 
tures such as a clique contain a large number of triangles. 
Another example is the absence of bridges in a community. 
A bridge is an edge that connects two connected compo- 
nents, and hence, having the two connected components as 
two different communities is more natural and intuitive [11] . 



Surprisingly, such requirements are not met by state of the 
art metrics, so they do not provide satisfactory communities. 

As a second contribution, we design a community detec- 
tion metric cahed Weighted Community Clustering (WCC). 
WCC is based on the notion that triangles are a good in- 
dicator of community structure. WCC takes into account 
the density and the layout of the triangles to rate the qual- 
ity of a community. We also prove that our triangle based 
approach fulfills the introduced properties. 

Finally, we show experimentaly that there is a correlation 
between communities with good WCC value and desirable 
statistical values. We show that while communities with 
large WCC are cohesive and dense, others with good mod- 
ularity and conductance values are not. We also compare the 
most used algorithms in the state of the art using WCC. 

The paper is structured as follows: in Section O we review 
the state of the art. In Section O we introduce the problem 
of community detection, propose the new metric (WCC) 
and introduce the properties. In Section U] we show that 
the current metrics in the state of the art do not fulfill the 
properties proposed. Finally, in Section [S] we compare sev- 
eral community detection methods using the WCC and in 
Section [6] we give guidelines for future work and conclusions. 

2. RELATED WORK 

There are basically two types of metrics to evaluate the 
quality of a community. First, those that focus on the inter- 
nal density of the community. The most widely used metric 
that falls into this category is the modularity, which was 
proposed by Newman et al. 26 . Modularity measures the 
internal connectivity of the community (omitting the exter- 
nal connectivity) compared to an Erdos-Renyi graph model. 
It has become very popular in the literature, and a lot of al- 
gorithms are based on maximizing it. Algorithms apply sev- 
eral optimization procedures: agglomerative greedy [6], sim- 
ulated annealing strategy [23] or multistep approaches [3]. 

However, it has been reported that modularity has reso- 
lution limits JT|[15]. Communities detected by modularity 
depend on the total graph size, and thus for large graphs, 
small well defined communities are never found. This means 
that maximizing the modularity leads to partitions where 
communities are far from intuitive. This is illustrated in 
Figure [3] by an example. 

The second type of metrics consists of those that focus 
on reducing the number of edges connecting communities. 
In [16 , Kannan et al. introduce the conductance. Con- 
ductance, is the ratio between the edges going outside the 
community and the total number of edges between members 
of the community. However, conductance suffers from the 
fact that for any graph, the partition with a unique commu- 
nity containing all the vertices of the graph obtains the best 
conductance, making its direct optimization not viable. A 
recent survey 21^ of community metrics discusses the perfor- 
mance of many metrics on real networks: the cut ratio [lOj . 
the normalized cut ^ , the Maximum-ODF (Out Degree 
Fraction), the Average-ODF and Flake-ODF [9]. In this sur- 
vey, Leskovec et al. showed that, among all these metrics, 
conductance is the metric that best captures the concept of 
community. Furthermore, their results reveal that the qual- 
ity of communities decreases significantly for those of size 
greater than around 100 elements. 

3. WEIGHTED COMMUNITY CLUSTERING 



3.1 Problem Formalization 

Given a graph G = {V, E), the problem is to classify the 
vertices of the graph into disjoint cohesive sets. The criterion 
to measure the cohesion of the sets is formally obtained by 
defining a metric, that is, a function / that assigns to each 
subset S of V a real number such that < f{S) < 1. A 
community is a set of vertices S, on which we compute a 
degree of cohesion f{S). Good communities have a large 
f{S) and bad communities have a small f{S). The adequate 
metric / for a given context (social networks, biology, etc.) 
captures the features of the communities in that context. 

A partition of V is a set 7^ = {Ci, . . . , Cn} of non-empty 
pairwise disjoint subsets of V such that Ci U ■ ■ ■ U Cn = V . 
A metric f in G defines in a natural way a value f{V) in 
each partition 7^ of V by taking the weighted average of the 
value of the function on the sets of the partition: 

For a given graph and a given metric / in C, the goal is to 
obtain an optimal partition, that is, a partition V such that 
f{V) is maximum. We call the communities in an optimal 
partition the optimal communities of the graph. 

3.2 Metric Definition 

A natural way to define the cohesion of a community is to 
define first the degree of cohesion of a vertex x with respect 
to a set S. That is, a function / that assigns to the pair 
{x,S) a real number f{x,S) in the range < f{x,S) < 1. 
Then the metric on S is defined by taking the average of 
/(x, S) with X G 5*, that is. 

In this paper, we propose a definition of metric f{S) that we 
call Weighted Community Clustering (WCC), which com- 
putes the level of cohesion of a set of vertices S. In the rest 
of the paper, we refer to our proposal for f{S) as WCC{S). 

In order to define WCC{S) we start defining WCC{x, S). 
With that objective, we denote by t(x, S) the number of tri- 
angles that vertex x closes with vertices in S and by vt{x, S) 
the number of vertices of S that form at least one triangle 
with X. WCC{x,S) is calculated as follows: 

r . vt(x,V) .P , / y^ / ^. 

WCC{x,S) = \ *(^'^) \s\{x}\+vt(x,v\s) iiH^:, \/j^u, 
[ if t(x, V) = 0. 

(3) 

Note that |5'\x| -\-vt(x, V\S) =0 implies that S = {x} and 
vt{x, V) = 0. Then the condition |5' \ x| + vt{x, V \ S) = 
is included in the condition t{x,V) = 0. 

The left fraction of WCC{x, S) is the ratio of triangles 
that vertex x closes with set S, as opposed to the number of 
triangles that x closes with the whole graph. On the other 
hand, the right fraction is the number of vertices that close 
at least one triangle with x, with respect to the union of 
such set and S. 

The cohesion of a partition is computed as stated in Equa- 
tion ([1]), by using WCC{S). Therefore, an optimal partition 
is such that, for all vertices of the graph, the two factors of 



Equation Q are optimized. The left fraction is maximized 
for a vertex x when set S includes all the vertices that form 
triangles with x. Note that since a pair of vertices can build 
many triangles, the left term rewards including the vertices 
that build more triangles with x. The right fraction is max- 
imized for X when set S contains no vertices such that x 
does not form triangles. The maximization process is a com- 
promise between both terms: the left term is optimized by 
including additional vertices in the set, but the second is 
optimized by removing vertices from the set. This behavior 
implies that good communities are those with a significant 
number of triangles well distributed among all the vertices. 

Proposition [T] introduces a set of natural properties of 
WCC{x^S) (proofs are available in Appendix lA)) . 

Proposition 1. Let G = {V, E) he a graph and / 5* C 
V. Then, 

(i) < WCC{x, S) < 1 for all xeV. 

(ii) WCC{x, S) = if and only ift{x, S) = 0. 

(iii) WCC{x,S) = 1 ifandonlyifvt{x,S) = > 2, 
and vt{x,V\S) = 0. 

The value of WCC{x, S) indicates the fitness of vertex x 
to become part of the set of vertices S. This value is a 
real number between and 1 (Proposition [T] (i)). These 
two extremes are only seen in particular situations (Propo- 
sition [T] (ii-iii)). For a given vertex x, in order to have some 
degree of cohesion with a S, the vertex must at least form 
one triangle with two other vertices in set S. If a vertex 
builds no triangle with the vertices in S, then the cohesion 
of the vertex with respect to the set is zero. On the other 
hand, the value one is reached if and only if all the vertices of 
S form at least one triangle with x. This property reflects 
the fact that the cohesion of a vertex x with respect to a 
set S, is maximized when S includes exactly all the vertices 
that close triangles with x. Furthermore, from the point of 
view of the WCC, only those edges in E closing at least one 
triangle are relevant and influence the cohesion of a vertex. 

WCC{S) indicates the quality of a community. We in- 
fer several properties on WCC{S) from Proposition [T] (see 
proofs in Appendix [B]) . 

Proposition 2. Let G = {V, E) he a graph and ^ ^ S C 
V. Then, 

(i) < WGG{S) < 1. 

(ii) WCC{S) — if and only if S has no triangles. 

(iii) WCC{S) — 1 if and only if S is a clique with vt(x, V\ 
S) = for all xeS. 

The clique is the subgraph structure that best resembles 
the perfect community, and thus, WCC rates it with the 
largest value. On the other hand, if the community has no 
triangles, its quality is the minimum possible. In Figure [TJa- 
d), we show a community of five vertices with an increasing 
number of internal triangles. The larger the triangles den- 
sity, the larger the WCC{S) value of the community. 



Example: 

o # # # 

(a) (b) 0.7 (c) 0.9 (d) 1 

Property 1: Internal Structure: 

(e) (f) 0.667 

Property 2: Linear Community Cohesion: 




(g) 0.833 (h) 0.860 (i) 0.881 (j) 0.814 
Property 3: Bridges: 



(k) 0.444 (1) 1 

Property 4: Cut Vertex Density: 




(m) 0.556 (n) 0.722 (o) 0.444 



Figure 1: Property examples. 



3.3 Properties 

In this section, we introduce a set of basic properties that 
any community cohesion metric for social networks should 
fulfin. We verify them for WGG, proving that WGG is 
a good candidate to distinguish communities in social net- 
works (proofs are given in Appendices IClE)) : . 

Property 1: Internal Structure. In several previous 
studies [41127] , it has been proved that one of the main char- 
acteristics of social networks is the presence of a large clus- 
tering coefficient and communities. Social networks have 
more triangles than expected in random graphs [2 4i27l31l33] 
and models describing the growth of social networks give tri- 
angle closing as a key factor of network evolution 20 . Thus, 
we take triangles as the indicator of the presence of com- 
munity structures. Then, the cohesion of a community 
given by a community metric for social networks, de- 
pends on the triangles formed by the edges inside the 
community. We verify this property for WCC: the left 
factor in Equation (|3|) is the ratio of the number of trian- 
gles the vertex x forms with the vertices in S as opposed to 
the number of triangles the vertex x forms with the whole 
graph. Hence, the factor is affected by the number of tri- 
angles inside the community. On the other hand, the right 
factor depends on the number of vertices that form triangles 
with vertex x. Therefore, the distribution of the triangles 
inside the community affect the right factor. Figure [D^e-f) 
shows an example of two partitions with the same number of 
edges, but distributed differently. We see that the vertices 
in Figure [D^e) form no triangles which translates to a value 
of WCC = 0. On the other hand, the vertices in Figure [2f) 
form four triangles, obtaining a larger value of WCC. We 



see that WCC reacts to the internal structure of the com- 
munities, and in particular to the presence of triangles. 

Property 2: Linear Community Cohesion. An inter- 
esting aspect to consider is the dynamics of community for- 
mation in social networks: what happens when there is an 
existent community and a new vertex appears in a graph, 
which is creating links with the members of the commu- 
nity In order to keep high quality communities, these must 
grow with cohesion. This means that a vertex can only join 
a community if it has a significant number of links with 
the members of the community. The larger the commu- 
nity, the more links are needed. Otherwise, the cohesion 
of the community decreases. This simple restriction limits 
the community growth if there is not a significant cohesion 
among its members. Therefore, the number of connec- 
tions needed between a vertex x and a set S, so that 
f{Syj{x}) > f(S,{x}), grows linearly with respect to 
the size of S. If it grew sublinearly, it would mean that 
the larger a community is, the easier would be for a vertex 
to join the community relative to the community size. On 
the other hand, if it grew faster than linear, the communities 
would have a maximum possible size. 

Theorem [1] proves this requirement for WCC: 

Theorem 1. Let C = (V, E) he a random graph of order 
r in which each edge occurs independently with probability 
p. Let V ^ V be a vertex adjacent to d > 2 vertices of 
V. Consider the two partitions Vi = {V U {v}} and V2 = 
{V,{v}}. Then, 

(i) (r + l)WCC{Vi) = (r - l)p + 2 • d • r"\ 

(ii) (r + 1)WCC{V2) = (r - d)p 

d ((r- l)j9+l)(r- l)(r-2y 
r" (r-l)(r-2)p2 + 2((i-l) 

(iii) For large enough r, WCC{Vi) > WCC{V2) if and 
only if d> rp (^^p'^ + 2p + 9 - (1 + p)^ /4. 

For instance, in the particular case of the clique (where p = 
1), it is necessary to connect to roughly more than one third 
of the vertices to become a member of the community. 

Corollary 1. Let S be a clique of order r. Given a 
vertex v, there must exist at least 0.37 • r edges between v 
and S to hold WCC{S U {v}) > WCC{S, {v}). 

In Figure [TJg-j) we show an example of Theorem [U where 
colors indicate different communities, and dashed lines rep- 
resent edges between communities. In (g) and (i), the whole 
graph is a community but in (h) and (j) the graph is split 
into two communities. When the external vertex has only 
two connections with the six vertices, the metric considers 
better to keep the vertex outside of the community. How- 
ever, when the number of connections is three, WCC has a 
better value when the vertex is included into the community. 

Property 3: Bridges. A bridge is an edge that if it is re- 
moved from the graph, it creates two connected components. 
The connections in real graphs are known not to be local, 
but can connect distant vertices 22 . A bridge is a very 
weak relation between two sets of vertices that are unre- 
lated, because it only affects one member of both datasets. 
Therefore, an optimal community in social networks 
can not contain a bridge. We prove that WCC is re- 
sist ent to bridges in the following theorem: 



Theorem 2. Let Si and S2 be two communities in a par- 
tition of graph C = E) such that: 

(i) 5*1 and S2 are the set of vertices of two different con- 
nected components. 

(ii) WCC{Si) > 0. 

Then, the following inequality holds: 

WCC{{Si,S2}) > WCC{{Si U &}). 

An edge that does not close any triangle, does not af- 
fect the computation of WCC because it alters no terms in 
Equation (|3|. A bridge is a particular case of such an edge, 
and therefore, it does not affect the quality of a partition for 
WCC. Since given two connected components it is better to 
have them separated than merged into a single community, 
then an optimal community cannot contain a bridge, be- 
cause there exists a partition with a better cohesion formed 
by the two separated components. In Figure [TJk-1) we show 
an example of the application of Theorem [2l We see that 
having the two cliques separated is better than considering 
a single community with a bridge, in terms of WCC. 

Property 4: Cut Vertex Density. A cut vertex is a ver- 
tex whose removal disconnects the graph into two or more 
connected components. A cut vertex is certainly a weak 
link in a community formed by the union of the two sets, 
because the vertices of the two sets have no relation among 
them. However, if the two sets have no other connection 
among them rather than the cut vertex, the two sets must 
be considered as independent communities on their own if 
they have enough cohesion internally. Therefore, an opti- 
mal community can not contain a cut vertex if the 
sets that it separates have a minimum density^. In 
Figure [TJm-o), we show two cliques (note that the clique is 
the highest density graph structure) of size five sharing a 
vertex. Here, WCC is able to separate the communities for 
this particular case because the red and blue sets of vertices 
have enough cohesion to become separate communities. We 
prove this property for WCC for the case where communi- 
ties have the highest possible density, which is the clique: 

Theorem 3. Let C = {V,E) be a graph of order n which 
consists of two cliques Kr and Kg of orders r and s, respec- 
tively, that intersect in a vertex t. Assume r > s > A. 

(i) IfVi = {Krl}Ks}, then 

^ ^ r+s-2 r+s-2 r+s-2 

(4) 

(ii) ifp^ = {Kr, Ks \ {t}}, then 

(g-l)(g-2)(g-3) , 
+ -2) ' 

^A vertex cut can be seen as an example of an overlapped 
community. However, it is not the aim of this work to con- 
sider the problem of overlapping communities. 



(iii) ifVs = {Kr \ {t}, {t}, Ks \ {t}, then 

(r- l)(r-2)(r-3) 



n • WCC{V3) 



+ 



(r-l)(r-2) 
(s-l)(s-2)(s-3). 



(6) 
(7) 



{s-l){s-2) 

(iv) M^CC(7^3)} < H^CC(7^2). 

(v) max{TyCC(7^i),TyCC(7^2),M^CC(7^3)} = WCC{V2)- 



This theorem ihust rates the fact WCC avoids merging 
two very weh defined communities (such as two cliques) be- 
cause of a single vertex. The reason is that WCC is a met- 
ric that not only takes into account the vertices that are 
connected and form triangles, but also the vertices that do 
not. Thus, if the triangles inside the community are not dis- 
tributed evenly among all the vertices, then the quality of 
the community is penalized. 

3.4 Examples of wcc on communities 

Figure [2] shows some examples of communities with differ- 
ent values of WCC. These communities are extracted ran- 
domly from the set of communities found in the real graphs 
used by the algorithms in Section \5\ The color of the ver- 
tices represents the percentage of neighbors belonging to the 
community. The darker the vertex, the larger the percent- 
age of neighbors of the vertex that belong to the commu- 
nity. On the other hand, the size of the vertices represents 
the percentage of vertices of the community that are actual 
neighbors of that vertex. The larger the size of the vertex, 
the more connected the vertex is with the other vertices of 
the community. In other words, the color represents the size 
of the edge cut that disconnects the vertex from the rest of 
the graph, while the size represents the density of edges of 
the vertex that connects it with other vertices in the com- 
munity. Thus, the better the community is, the larger and 
darker are its vertices. In the figure, we see that the larger 
the WCC of the community, the larger and darker are the 
vertices of the community, which means that they are more 
densely connected and better isolated from the rest of the 
graph. We see then, that there is a correlation between high 
WCC values and good communities. 

4. COMPARISON WITH OTHER METRICS 

The properties show that WCC is a metric capable of fa- 
voring those communities with a large quantity of triangles 
involving all the vertices of the community. We are ensur- 
ing, like the informal community definition says, that all the 
vertices forming the community are highly connected among 
them. This property is not fulfilled by other proposed met- 
rics, such as the conductance. Compared to WCC, which 
is based on triangles, conductance is based on the edge cut. 
Minimizing the cut is problematic because the partition con- 
sisting of a single community containing all the vertices of 
the graph has the best value, and thus it is the optimal com- 
munity. This makes impossible to design algorithms that 
simply optimize conductance. Properties 1, 2 and 3 do not 
apply to conductance, because as soon as there is an edge 
connecting two communities or a single vertex connecting 
to a community, joining them into a single community will 
grant a better conductance value. 




(b) 0.14191 





e) 0.52118 (f) 0.65128 (g) 0.78072 (h) 0.92798 



Figure 2: Examples of communities from real 
graphs, sorted by WCC. 




Figure 3: Ring with 24 cliques of 5 vertices each 
(shaded circles). Setting each clique as a community 
has a modularity of 0.8674, but merging adjacent 
cliques has modularity 0.8712 [15j. 



In the case of modularity, it suffers from resolution lim- 
its [lllll5j . This resolution problem is exemplified in Fig- 
ure [3l where the optimal communities for modularity are 
groups of two cliques. In this example, the communities 
found by optimal modularity contain a bridge, and thus they 
do not verify Property 3. However, the natural communi- 
ties are the groups of five vertices forming cliques, which 
are the optimal communities for WCC. The WCC value 
of the five vertices clique is one, so having a partition with 
each clique as a community has the maximum WCC value. 
Furthermore, is has been shown [2] that trees, which can- 
not be considered communities, can have arbitrarily large 
modularity. We show that WCC is a metric that sees the 
communities in a local fashion, focusing in the internal den- 
sity and the connections with their surroundings instead of 
the whole graph. Modularity assumes that graphs are ho- 
mogeneous, whereas they are not. 

5. EXPERIMENTS 

In this section, we first select some of the most relevant 
community detection algorithms and analyze them. We use 
synthetically generated graphs from which we know the com- 
munities beforehand, to show that WCC favors those com- 
munity detection algorithms that best capture the actual 
communities. We also execute the algorithms on graphs 
from real world data. We prove experimentally that WCC 
captures the nature of a community by studying the correla- 
tion of WCC with statistical properties of the communities. 

The algorithms used to extract the communities are In- 
fomap [30], which is based on random walks; Blondel [3], 
which is based on multilevel maximization of modularity 
locally; Clauset [Gj, which maximizes the modularity iter- 



atively; Newman ^5], which maximizes modularity by ex- 
ploiting the spectral properties of the graph and Duch [8], 
which uses heuristic search based on extremal optimization 
to optimize the modularity. 

We choose Infomap and Blondel because they are the 
best for detecting communities in social networks accord- 
ing to Clauset, Newman and Duch are chosen because 
their popularity in the literature. The implementations of 
Infomap, Clauset and Blondel are taken from their authors' 
web. In the case of Newman and Duch, we have used the 
Radatools library jl4J. Our selection covers a wide range 
of community methods to test the validity of WCC but 
does not intend to be an evaluation survey of all community 
methods. Besides, other popular approaches in the litera- 
ture, such as [U [191 [28] among others, aim at overlapping 
communities which are also out of the scope of this paper. 

5.1 Synthetic Graphs 

In this section, we use synthetically generated graphs, 
where the communities are known beforehand. We build 
graphs of 10k vertices with social network topology with a 
generator [18 . We use the default parameters, which are 
typical of social networks (used also in [17 ). We vary the 
mixing factor, which is the percentage of edges that connect 
a vertex with other vertices outside the community, from 0.1 
to 0.7. 

The quality of the result for each algorithm is measured by 
the normalized mutual information (NMI), which computes 
the overlap between the algorithm output and the bench- 
mark [12], and it is shown in Figure [4l^a). We see that In- 
fomap stands as the best algorithm, followed by Blondel and 
Duch at a small distance. We see that, the larger the mixing 
factor, the higher the difficulty to correctly find the commu- 
nities by the algorithms. 

On the other hand, in Figure [4l^b) we show the WCC for 
each algorithm. In this case, the best algorithm is again 
Infomap. Blondel and Duch perform slightly worse than 
Infomap, and Clauset and Newman stand as the worst algo- 
rithms of all in terms of WCC. We observe that WCC is 
a good model for the communities built by the benchmark. 
Those algorithms with high NMI also have high WCC. 

We quantify the correlation of both metrics in Figure[4l^c). 
We apply the Kendall's rank correlation coefficient 5 , which 
compares two relations of order. A value of 1 indicates that 
the two metrics are fully correlated, while indicates that 
no correlation is found. If the two metrics are inversely 
correlated the value is -1. 

In Figure [DJc), we observe that the agreement of both 
metrics is excellent. For all mixing factors, except for 0.2 
and 0.6, the Kendall's tau correlation coefficient is 1, which 
means that the correlation is perfect. Only one pairwise 
comparison for 0.2 and 0.6 mixing factors is reversed, which 
correspond to the pair Blondel-Duch (see Figures [3| (a) and 
(b)). However, in both cases the difference in NMI between 
both methods is tiny (less than 1%), and thus, it is difficult 
to discern which community partition is better. Moreover, 
for all mixing factors the kendall significance test (signifi- 
cance 0.05) concludes that there is statistical evidence that 
both variables are correlated. 

Overall, the average Kendall correlation is 0.94, which is 
very high. In other words, the synthetic community gener- 
ation procedure, which is not based on triangles, generates 
communities that match our community definition based on 
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Vertices 


27,769 


18,771 
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82,168 


Edges 


352,285 


198,050 


183,831 


504,230 


Avg. degree 


25.37 


21.1 


10.02 


12.27 


Max. degree 


2,468 


504 


1,383 


2,552 



Table 1: The real world graphs used for testing. 



WCC. Therefore, we conclude that WCC captures com- 
munity structure effectively and that WCC is an adequate 
indicator of the quality of the communities found in a graph. 

5.2 Real World Networks 

In this section, we show that there is a correllation be- 
tween communities with good WCC values and good statis- 
tics. We study the following measures: triangle density, 
which is the number of internal triangles in the community 
divided by the total number of possible internal triangles; 
the average inverse edge cut, which is, the average number 
of neighbors of a vertex that belong to the same community 
divided by the total number of neighbors; the average edge 
density, which is the average number of neighbors that a 
vertex has in the community divided by the total number 
of members of the community; the modularity; the conduc- 
tance; the normalized diameter, which is the diameter of the 
community divided by the logarithm of its size; the bridge 
ratio, which is the percentage of edges in a community that 
are bridges; and the vertex size of the communities. 

We create a pool of communities by running the com- 
munity detection algorithms on four real world networks, 
covering different aspects of real world datc0- ArxivCit is 
a citation network, ArxivCol represents the collaborations 
between scientists, Enron is derived from email communi- 
cations and Slashdot is extracted from a website social net- 
work. Table □ summarizes the graph properties. 

We sorted all the communities obtained by the five algo- 
rithms by their WCC value decreasingly. Then, we divided 
the communities into 20 groups in steps of five percentiles 
according to their WCC and plotted for these 20 groups 
their correponding statistics in Figure [5] In all the charts, 
the X axis represents the group identifier (e.g. the leftmost 
group is always the 95 percentile that contains the top 5% 
communities in terms of their WCC) while the y axis shows 
the corresponding statistical value. The communities of size 
one and two, are ommited since their WCC{S) value is al- 
ways zero. As shown in Figure [5j a), the leftmost communi- 
ties have high WCC values, and the rightmost communities 
have the lowest WCC values. 

Broadly speaking, we observe two sections: from groups 
1 to 12, the trends for all statistics show that communities 
with higher WCC have better properties; from groups 13 to 
20 this trend apparently changes in some statistics. We focus 
first on groups 1-12 and we analyze groups 13-20 below. 

Groups 1-12: In Figure EJb), we see that the larger the 
WCC, the smaller the edge cut so, the number of external 
connections of the community decreases. On the other hand, 
in Figure [Sljc) we see that the larger the WCC of a commu- 
nity, the larger the internal density of edges. While these 
two characteristics are a good starting point to identify a 
good community, they do not imply an internal structure 

^Downloaded from SNAP (http://snap.stanford.edu). 

We cleaned the original graphs by removing the sell loops. 
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Figure 4: (a) NMI value and (b) WCC value for the most relevant state of the art algorithms on synthetic 
graphs with different mixing factors, (c) Kendall's tau correlation coefficient between NMI and WCC. 
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Figure 5: Statistics of communities from real world networks in 20 groups sorted by WCC. 



which is shown in Figure [5jd): the larger the WCC of a 
community, the larger its triangle density. These transitive 
relations between the vertices (Property 1) indicate a good 
social structure of the communities. 

Figure E^e) shows how bridges penalize WCC. A large 
percentage of bridges is a symphtom of the presence of whiskers 
or treelike structures, which are inherently sparse and hence 
do not have the type of internal structure that one would 
expect from a community. We note that communities that 
contain bridges are not the optimal communities because of 
Property 3. A small diameter is another feature that any 



good community should have. In Figure [5ljf) we see that 
large WCC implies smaller diameters for the communities. 
This means that any vertex in the community is close to any 
other vertex, which translates to denser communities. 

In Figure [SJg-h) we compare WCC with the two other 
metrics in the state of the art: modularity and conductance. 
We see that there is a correlation between communities with 
good WCC values and modularity and conductance (for 
conductance, the lower, the better). Finally, in Figure [5^]), 
we show the sizes of the communities. 

Groups 13-20: We see that there is a change on the trend 
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Figure 6: WCC for communities from real graphs. 



for some statistics for those groups that have WCC close to 
0. This behavior can be explained by Figures [5]Jd) and (e). 
These figures reveal that the communities after group 13 are 
treelike: almost all the edges in the community are bridges, 
and communities hardly contain triangles. Therefore, such 
structures cannot be accepted as good communities. Al- 
though some communities in group 13-20 are isolated (and 
thus have good conductance) , we note that this is not a suffi- 
cient condition to be good communities. For example, most 
communities in groups 17-20 are trees with three vertices, 
which have a good conductance. As described in 2 , tree like 
networks can have high modularity and hence, algorithms 
maximizing it can lead to misleading results (Figure [5jg)). 

Finally, in Figure [6] we compare the different algorithms 
used in terms of WCC. We see that the results obtained 
are similar to those obtained with synthetic graphs, with In- 
fomap outperforming the rest of the algorithms. However, 
we see that in this case, Clauset performs at a level compa- 
rable to Blondel and Duch or even better. This might in- 
dicate that synthetic graphs fail at accurately represent the 
inhomogeneities and noise present in real graphs, so using 
this graphs only when evaluating the quality of community 
detection algorithms can derive to misleading results. 

6. CONCLUSIONS AND FUTURE WORK 

Although different metrics have been previously proposed 
to evaluate the quality of a graph partition into communities 
for social networks, these fulfill only partially the concept of 
community. Even the most popular metrics applied in the 
state of the art (modularity and conductance) fail at meet- 
ing some minimal properties expectable for a social network 
communities, such as dealing with bridges, vertex cut, scal- 
ability on the community formation or imposing a minimal 
internal structure such as the triangle. The reason for this is 
that the current metrics are based on the informal definition 
of a community, which is unable to fully capture the com- 
munity concept for social networks by its own. We conclude 
that the concept of community is strongly dependent on the 
domain of the graphs being analyzed, which is something 
that the current metrics do not take into account. This 
suggests that the definition of a minimal set of properties 
extending this formal definition is required, in order to ex- 
ploit the inherent characteristics of the type of graph being 
analyzed and its semantics (social networks in our paper). 

In this paper, we proposed WCC, which compares the 
quality of two graph community partitions. Such a metric 
captures the community concept by meeting the enumerated 
minimal properties, enabling to distinguish a good from a 



bad community partition automatically. Then, it is pos- 
sible to compare the quality of algorithms or even design 
efficient community computing algorithms based on WCC. 
We have shown experimentaly that communities with good 
WCC values are dense, have small edge cuts, have transitive 
relations without bridges and small diameters. We have also 
shown that looking only at the internal density and small 
edge cuts does not guarantee well defined communities with 
internal structure, since it can lead to treelike communities. 

Regarding the future work, an interesting problem related 
to that discussed in the paper is the location of overlapped 
communities in the graph. Some graph patterns, such as cut 
vertices, can be naturally modeled as the overlap of two com- 
munities. The overlapped community problem, similarly to 
the non-overlapped case, has similar deficiencies in the sense 
that there is no formalization of a minimal set of properties 
that a metric should fulfill. Therefore, our work will con- 
tinue toward extending the community definition concepts 
and WCC to the detection of overlapping communities. 
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APPENDIX 

A. PROOF OF PROPOSITION 1 

Proof, (i) This is a consequence of the inequalities t{x,S) < 
t(x,V) and 

vt{x,V) = vt{x,S)+vt{x,V\S) (1) 
< \S\{x}\+vt{x,V\S). (2) 

(ii) If WCC{x, S) = 0, then at least one of the follow- 
ing three identities holds: t{x, V) = 0, vt{x, V) = 0, and 
t(x, S) — 0. Now, each one of these conditions implies 



t(x, S) = 0. Reciprocally, by definition, if t(x, S) = 0, then 
WCC{x,S) = 0. 

(iii) Assume WCC{x,S) = 1. By (ii), t{x,S) / 0. Hence, 
there exists an edge {y, z} : y ^ {S \ {x}) and z ^ (S \ {x}) 
forming triangle with x. Then 15* \ {x}| > 2. As the 
two fractions defining WCC{x, S) are < 1, the condition 
WCC{x, S) = 1 implies that both fractions are 1. The con- 
dition t{x,V) = t{x, S) is equivalent to vt{x,V \ S) = 0. 
Since WCC{x, S) = 1, the inequality {|2]) is an equality, and 
we have vt{x, S) = \S \ {x}\. 

Reciprocally, the condition vt{x, V\S) = implies t{x,S) = 
t{x,V) and vt{x,S) = vt{x,V). As vt{x,V) = vt{x,S) = 
{x} > 2, we have that both fractions in the definition of 
WCC{x, S) have denominator 7^ and both fractions are 1. 
Therefore, WCC{x, S) = 1. □ 

B. PROOF OF PROPOSITION 2 

Proof. The proofs are a consequence of Proposition [Tl 
(i) Since < WCC{x, S) <iyxe S, then < WCC{S) < 
1. (ii) WCC(S) = implies that Wx e S WCC{x,S) = 0. 
Since the condition for WCC(x, S) = is that t{x,S) — 0, 
then WCC{S) = implies that S has no triangles, (iii) 
WCC{S) = 1 implies that for all x e S WCC{x,S) = 
1. This implies that does not exist a vertex x ^ S that 
t{x,V \S) and vt{x,S) = \S \ {x}\. Thus, all the 
vertices x ^ S form triangles only with and with all the 
other vertices in S, which implies having an edge with all 
the vertices in S, and hence forming a clique. □ 

C. PROOF OF THEOREM 1 

Proof. Let N be the set of neighbors of v. 

(i) For xeV,we have WCC(x,V) = vt{x,V)/r. Now, 



vt{x,V) 



(r - l)p iixeVr\N; 
{r - l)p +1 if X G iV; 
d if X G {v}. 



Then 

(r + 1)WCC{V^) Hr - d) ^^-^ + d^l^^^^ + ^ 



--(r-l)p + 2- 



(ii) For X eVr\N, 



WCC{x,Vr)^^-^^=^^^^=p. 



r - 1 



r - 1 



For X E N, we have 



t{x,Vr) 



t{x,V) 



r - 1 
2 



\^]p' + {d-l)p; 



(r - l)p+ 1; 
(r - 1) + 1 = r. 



vt{x,V) 

\Vr\{x}\+Vt{x,V\Vr) 

Moreover, WCC{v, {v}) = 0. Then, 

(r+l)M/GG(7^2) - {r-d)p+- _ _ 3)^2 + 2(d - 1) " 



(iii) We have, 

(r + 1) {WCC{Vi) - WCC{V2)) = p{d - 1) + 2- 

r 

_ d ((r-l)j9+l)(r-l)(r-2)/ 
7 (r-l)(r-2)p2 + 2((i-l) ' 

and the condition Vl/^CC(:Pi) - M^CC(7^2) > is equivalent 
to the condition 



ad +bd + c> 0, 



(3) 



where 

a=2(2+pr), 

h =p^{p + l)r^ - p(3p^ + 3p + 4)r + 2/ + 2p^ - 4, 

c = — p'^r'^ + 3p^r^ + 2p(l — p^)r. 

For short, let we denote by O(r^) a polinomial expression 
of degree at most n. Then, the greatest solution of (|3]) is. 



V(l + P)r^ + 0{r) + v^p4(p2 + 2p + 9)r4 + 0(r3 



4(2 +pr) 



and we get 



lim — 



\l+p)+p^y/p^ + 2p + 9 

4p 



Thus, for a large enough r, the condition 

d>rp + 2p + 9 - (1 + p)) /4, 

is equivalent to TyC7C(7^i) > WCC{V2). □ 

Note that the function p^p (^^p^ + 2p + 9 - (1 + p)^ /4 
is increasing in p. A greater value of p means a greater co- 
hesion in G, and then a greater value of d/r is needed for 
WCC{Vi) being greater than WCC{V2). 

In the case of Corollary [l] p = 1, thus d > V3-1/2 = 0.37. 

D. PROOF OF THEOREM 2 

Proof. Let S = Si U &• For x e Si,i e {1,2} we 
have t{x,Si) = t{x,S), vt{x,V \ Si) = vt{x,V \ S) and 
\S^\{x}\ < \s{x}l Then, 

t{x,S) vt{x,V) 



WCC{x,S) = 
< 



t{x,V) \S\{x}\+vt{x,V\S) 
t{x,Si) vt{x,V) 



t{x,V) \Si\{x}\+vt{x,V\S^) 
= WCC{x,S^). 

Therefore, 

\S\-WCC{{Si,S2}) = 
= |5'i| • WCC{Si) + \S2\- WCC{S2) 

= ^ WCC{x, Si) + ^ V^CC(x, &) 

> ^iyCC(x,5'). 

xes 

implies 

WCC{{Si,S2}) > ^^T^CC7(x,5) 



WCC{S) = 11/CC(5'i u □ 



E. PROOF OF THEOREM 3 

Proof, (i) For the r — 1 vertices x G Kr\{t}, we have 
H^C7C7(x,y) = {vt{x,V)/{n- 1) = (r - l)/(n - 1). For the 
vertex t, we have WCC{v,V) = 1. Finally, for the s — 1 
vertices x G Ks\{t}, we have Vl/CC(x, = (s-l)/(n-l). 
As n — 1 = r + s — 2, we obtain the formula jlj). 

(ii) For the r — 1 vertices x G i^r we have WCC{x, Kr) — 
1. For the vertex t, we have 



WCC{x,Kr 



n- 1 



(r-l)(r-2) 



(r-l)(r-2) + (s-l)(s-2)' 
For the s — 1 vertices x G Ks \ {t}, we have 

(V) s-l _ (s-2)(s-3) 



VFCC(x,JC,\{i}) 



(-1) ,_i (5-i)(s-2)- 



This gives the formula (O. 
(iii) For x G i^r \ {t}. 



WCC{x,Kr\{t}) 



for vertex t, 



("2') r-l _ (r-2)(r-3) , 
C-i) r-l (r-l)(r-2)' 



VKCC(x,{t})=-|^ 



n- 1 



0+r-l+s-l 



. (l-l)(l-2) 
(r + s-2)(r + s-3) 



= 0; 



for X e is:. \ {t}, 



WCC{x, Ks \ {t}) = 



(s-2)(s-3). 



(s-l)(s-2)' 

This implies ([7|). 

(iv) Define /i(r, s) = n-WCC{Vi), f2{r, s) = n-WCC{V2), 
and /3(r, s) = n • WCCiVs). The expression of these func- 
tions are those in (|4|), ([5]) and (O, respectively. The goal 
is to show that for all integers values r, s with r > s > 4 
the inequality fs^r^s) < /2(r, s) holds. Clearly, the first 
summand of /3(r, s) is smaller than the first summand of 
/2(r, s), and the third summands are equal. Then, to prove 
fsir^s) < f2{r,s) it is sufficient to compare the second 
summands. As the second sumand of /3(r, s) = 0, then 
Mr,s)<f2{r,s). 

(v) We shall prove /2 (r, s) — fi (r, s) > for n > 7 and 
4 < r < n — 3. We have s = n — r + 1 and 



/2(r,s) - /i(r,s) >n-4- 



(r-l)^ + (n-r)^ 
n-1 

-2r^ + (2 + 2n)r - 5n + 3 
n-1 ■ 



The sign of /2(r, s) — /i(r, s) is the sign of the polinomical 
function — 2r^ + 2(n+l)r — 5n+3, which is a convex function 
on r with roots: 



ri 



^{n+l-Vn^ -8n + 7); r2 = |(n+l + Vn2 -8n + 7). 



Now, for n > 7, we have ri < 4 and r2 > n — 3. Therefore, 
for each r G {4, . . . , n — 3} we have /2 (r, s) — /i (r, s) > 0. □ 



